date()
## [1] "Mon Nov 27 10:41:04 2023"
Prep packages:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Load the Boston data from MASS the R package, explore the structure and the dimensions of the data:
The dataset contains “Housing Values in Suburbs of Boston”.
More information on the data can be found here:
[https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html]
data("Boston")
A graphical overview of the data and summaries of the variables in the data:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# ggpairs(Boston, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
The distribution of the variabes are different. Some, like zn and age, seem to be from zero to a hundred, some (like chas) from zero to one, and some have complitely different scales: tax from 187 to 711, black from 0.32 to 397.
Many of the variables are heavily correlated. For example, lstat and medv have a negative correlation of almost 1, and rad and tax a positive correlation of almost one.
Standardizing the dataset:
# scale the variables (all numeric)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Standardizing helps to unify variables in very different scales. The variables have been scaled and centered. In the standardized data, the mean of each column is zero now.
Create a categorical variable of the crime rate in the Boston dataset:
boston_scaled <- as.data.frame(boston_scaled) # we need this too
boston_scaled$crim <- as.numeric(boston_scaled$crim) # we need this too
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
Drop the old crime rate variable from the dataset:
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Divide the dataset to train and test sets:
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Fitting the linear discriminant analysis on the train set & drawing the LDA (bi)plot:
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2524752 0.2326733 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 0.96645965 -0.9212144 -0.12090214 -0.8535598 0.44810518 -0.8568763
## med_low -0.05384482 -0.2874880 0.03646311 -0.5963858 -0.09670717 -0.3880420
## med_high -0.38506488 0.1384192 0.27216352 0.3564530 0.16717690 0.4042563
## high -0.48724019 1.0170690 -0.08304540 1.0352191 -0.46146119 0.8056106
## dis rad tax ptratio black lstat
## low 0.8376777 -0.6781900 -0.7505003 -0.46001999 0.38227261 -0.764105556
## med_low 0.4389733 -0.5506331 -0.5179629 -0.06867187 0.35518382 -0.145832181
## med_high -0.3478582 -0.3722063 -0.2875995 -0.31213089 0.04010864 0.001686464
## high -0.8453773 1.6386213 1.5144083 0.78135074 -0.80175115 0.886882560
## medv
## low 0.544682470
## med_low 0.007625754
## med_high 0.203482420
## high -0.687937968
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10648667 0.78594870 -0.79158541
## indus -0.02441030 -0.28650157 0.64263523
## chas -0.08397676 -0.09366272 0.12040410
## nox 0.38471916 -0.60328641 -1.32403626
## rm -0.09623505 -0.12179437 -0.05186723
## age 0.26440199 -0.34572572 -0.20983866
## dis -0.13520685 -0.39589664 0.37986046
## rad 2.90239847 1.10019954 0.13212773
## tax 0.09649070 -0.25594090 0.24211036
## ptratio 0.10980039 0.02099939 -0.17580804
## black -0.14564100 0.03061035 0.17310861
## lstat 0.19284745 -0.26240837 0.44047577
## medv 0.17423744 -0.38744108 -0.21591785
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9484 0.0366 0.0150
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Saving the crime categories from the test set and then removing the categorical crime variable from the test dataset:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Predicting the classes with the LDA model on the test data:
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 8 0 0
## med_low 3 12 9 0
## med_high 1 11 20 0
## high 0 0 0 23
Many have been predicted correct (13+16+15+26) / (13+11+1+4+16+3+2+11+15+26) = 0.6862745, in total 68.6% are correct. Although may are still predicted wrong, especially in the low and med_high categories. The model works the best for the high category, where all are correct.
Reloading the Boston dataset and standardizing the dataset:
# from above
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too
Calculating the distances between the observations:
# with euclidean distance
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Running k-means algorithm:
# k-means clustering
km <- kmeans(boston_scaled, centers = 3) # trying with 3 clusters to begin with
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Investigating the optimal number of clusters and running the algorithm again:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:5], col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
pairs(boston_scaled[11:13], col = km$cluster)
The optimal number of clusters is when the line drops a lot. Deciding on this is very subjective. One can choose two clusters for the slope being the biggest till that, or maybe 6 for there the descent evens out. We are going with 2 clusters now.
The variables separating the two groups in the pairs plots are, for instance, crim&zn and crim&nox.
Bonus section
Performing k-means on the original (standardized) Boston data:
# like above:
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too
# k-means clustering
km <- kmeans(boston_scaled, centers = 3) # trying with 3 clusters
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Performing LDA using the clusters as target classes:
# not doing a train and test split here for it was not asked for
boston_scaled$cluster <- km$cluster
# linear discriminant analysis
# lda.fit <- lda(cluster ~ ., data = boston_scaled)
# Error in lda.default(x, grouping, ...) :
# variable 4 appears to be constant within groups
# -> run the model with all variables except for the fourth one
lda.fit <- lda(cluster ~ ., data = boston_scaled[,-4])
# print the lda.fit object
lda.fit
## Call:
## lda(cluster ~ ., data = boston_scaled[, -4])
##
## Prior probabilities of groups:
## 1 2 3
## 0.06916996 0.61067194 0.32015810
##
## Group means:
## crim zn indus nox rm age dis
## 1 -0.2048299 -0.1564737 0.2306535 0.3342374 0.3344149 0.3170678 -0.3634565
## 2 -0.3882449 0.2731699 -0.6264383 -0.5823006 0.2188304 -0.4585819 0.4807157
## 3 0.7847946 -0.4872402 1.1450405 1.0384727 -0.4896488 0.8062002 -0.8383961
## rad tax ptratio black lstat medv
## 1 -0.02700292 -0.1304164 -0.4453253 0.1787986 -0.1976385 0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183 0.3151747 -0.4640135 0.3182241
## 3 1.12436056 1.2034416 0.6329916 -0.6397959 0.9277624 -0.7457491
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.02479166 -0.13204141
## zn 0.42787622 -0.04638198
## indus 1.15646011 0.64348753
## nox 0.47943272 0.42987408
## rm 0.13637610 -0.12823804
## age -0.06654278 0.30029385
## dis -0.01915297 0.20848367
## rad 0.74637979 0.69845574
## tax 0.27967651 -1.02040695
## ptratio 0.19355485 -0.21964359
## black -0.04753224 0.11581547
## lstat 0.47213016 0.02172623
## medv 0.06797263 0.92531426
##
## Proportion of trace:
## LD1 LD2
## 0.9822 0.0178
Visualizing the results with a biplot:
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
The variables affecting the LDs are (for example, only the top 2 listed here): indus and rad for LD1, and tax and medv for LD2. The division is not as good or clear as in some other examples encountered in the excercises. Classes 1 and 2 seem to go together, and 3 be separate, but there is overlap fbetween them all.
Super bonus section:
# scaled train data from above
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too
boston_scaled$crim <- as.numeric(boston_scaled$crim) # we need this too
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
# lda.fit
# has teh LD1-3 in it
# k-means clustering also needed
# km <- kmeans(train, centers = 2) # 2 clusters here to match the excercise above where we chose 2 to be the optimal number
# Warning: NAs introduced by coercionError in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
# we need to make crime a numeric column (now factor)
train$crime <- as.numeric(train$crime)
km <- kmeans(train, centers = 2) # ok
# the example script copied here
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# the original plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# modifying the plot: Set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# modifying the plot: color is defined by the clusters of the k-means
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
The plots have some differences and similarities. In all of them, there is some overlap in the grops, and having only 2 classes in k-means still doesnt differenciate the grups that well even though in the 3D plot there are two clear groups it could pick up on. In both plots the tighter cluster has mainly one colour/group/class label, whereas the sparcer cluster has more.